Entropy
In information theory, entropy is a measure of uncertainty. Mathematically, for a random variable \(X\) with probability density function \(p(x)\), it is defined as:
\[ \mathrm{H} (X) \coloneqq - \mathbb{E}[\log_b {p(X)}] \]
But why does entropy measure uncertainty?
Biased coins
Let’s play a simple game with a coin. We both put in $1,000 into the pot — if I flip heads, you win the pot, otherwise I win the pot. If we use a fair coin, any sane person would think twice before playing the game — it is uncertain whether you will win or lose money. But say you know that the coin is biased and there’s a 99% chance of flipping heads. Of course you’d take the bet — it is relatively certain that you would win. Similarly, if you know that there’s a 1% chance of flipping heads, you would never take that bet — it is relatively certain that would lose.
We can formalise this uncertainty with entropy. Letting \(X=1\) represent heads and \(X=0\) represent tails, we can calculate the entropy of the outcome of a flip of a fair coin:
\[ \begin{align} \mathrm{H} (X) &= -\sum_{x\in \mathcal{x}} p(x) \log{p(x)} \\ &= - p(0)\times \log{p(0)} - p(1) \times \log{p(1)} \\ &= -0.5\times \log{0.5} - 0.5 \times \log{0.5} \\ &= 0.6931 \; \mathrm{nats} \end{align} \]
Note that the choices of logarithm base \(b=2\), \(b=e\) and \(b=10\) correspond to the entropy units bits, nats and bans, respectively. In the above, and for the remainder of the post, we work in nats (i.e. \(b=e\)).
In the same way, we can calculate the entropy for all possible coin biases/probabilities of flipping heads, and we get a plot like this:
Entropy captures that uncertainty we feel making the bet. It peaks in the middle, which is when a fair coin is flipped. The more we stray from the centre, the more biased the coin is, and the less uncertain we are about the outcome of the coin flip.
Intuition behind the entropy formula
Let’s take a closer look at the equation for the entropy:
\[ \mathrm{H}(X) = \overbrace{\sum_{x\in\mathcal{X}}}^{\text{sum over all possible realisations of } X} \underbrace{p(x)}_{\text{probability of the realisation } x} \times \underbrace{(-\log{p(x)})}_{\text{how surprising } x \text{ is}} \]
A more precise way of thinking of entropy is as the expected ‘surprise’ from a random variable. Specifically, \(-\log{p(x)}\) represents the surprise for a particular realisation of \(X\).
As the probability of the realisation \(x\) increases, \(-\log{p(x)}\) decreases — in other words, the more likely it is that \(X\) takes on the value of \(x\), the less surprising it is when \(X\) takes on the value \(x\).
We then weight each possible realisation’s surprise by how likely it is to occur, and sum it all up to get the expected surprise of the random variable \(X\) — the entropy of \(X\)!
Note that the choice of \(-\log{p(x)}\) as the measure of surprise has an additional intuitive property. If we have two independent events \(A\) and \(B\) with probabilities of occurring \(p\) and \(q\), the probability of both of them occurring is \(pq\). The surprise from both of them occurring at the same time is:
\[ \begin{align} \mathrm{Surprise}(A \text{ and } B) &= -\log{\mathbb{P}(AB)} \\ &= -\log{pq} \\ &= -\log{p} - \log{q} \\ &= \mathrm{Surprise}(A) + \mathrm{Surprise}(B) \end{align} \]
Hence, the choice of \(-\log{p(x)}\) as the measure of surprise results in surprise being additive!
Also, all the intuition above applies to continuous random variables too — the entropy of a continuous random variable is:
\[ \mathrm{H}(X) = \overbrace{\int_{\mathcal{X}}}^{\text{'sum' over all possible realisations of } X} \underbrace{p(x)}_{\text{'probability' of the realisation } x} \times \underbrace{(-\log{p(x)})}_{\text{how surprising } x \text{ is}} \; dx \]
In the interactive plot below, we have a normally distributed variable with a fixed mean and an adjustable standard deviation.